AITopics | song generation

Collaborating Authors

song generation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SongCreator: Lyrics-based Universal Song Generation

Neural Information Processing SystemsMar-21-2026, 15:30:50 GMT

Music is an integral part of human culture, embodying human intelligence and creativity, of which songs compose an essential part. While various aspects of song generation have been explored by previous works, such as singing voice, vocal composition and instrumental arrangement, etc., generating songs with both vocals and accompaniment given lyrics remains a significant challenge, hindering the application of music generation models in the real world. In this light, we propose SongCreator, a song-generation system designed to tackle this challenge. The model features two novel designs: a meticulously designed dual-sequence language model (DSLM) to capture the information of vocals and accompaniment for song generation, and a series of attention mask strategies for DSLM, which allows our model to understand, generate and edit songs, making it suitable for various songrelated generation tasks by utilizing specific attention masks. Extensive experiments demonstrate the effectiveness of SongCreator by achieving state-of-the-art or competitive performances on all eight tasks.

artificial intelligence, proceedings, songcreator, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.76)

Add feedback

SongCreator: Lyrics-based Universal Song Generation

Neural Information Processing SystemsFeb-16-2026, 16:33:05 GMT

SongCreator, a song-generation system designed to tackle this challenge.

accompaniment, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
Asia > China > Hong Kong (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.92)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)
Information Technology (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

LeVo: High-Quality Song Generation with Multi-Preference Alignment

Lei, Shun, Xu, Yaoxun, Lin, Zhiwei, Zhang, Huaicheng, Tan, Wei, Chen, Hangting, Yu, Jianwei, Zhang, Yixuan, Yang, Chenyu, Zhu, Haina, Wang, Shuai, Wu, Zhiyong, Yu, Dong

arXiv.org Artificial IntelligenceOct-24-2025

Recent advances in large language models (LLMs) and audio language models have significantly improved music generation, particularly in lyrics-to-song generation. However, existing approaches still struggle with the complex composition of songs and the scarcity of high-quality data, leading to limitations in audio quality, musicality, instruction following, and vocal-instrument harmony. To address these challenges, we introduce LeVo, a language model based framework consisting of LeLM and Music Codec. LeLM is capable of parallel modeling of two types of tokens: mixed tokens, which represent the combined audio of vocals and accompaniment to achieve better vocal-instrument harmony, and dual-track tokens, which separately encode vocals and accompaniment for high-quality song generation. It employs two decoder-only transformers and a modular extension training strategy to prevent interference between different token types. To further enhance musicality and instruction following ability, we introduce a multi-preference alignment method based on Direct Preference Optimization (DPO). This method handles diverse human preferences through a semi-automatic data construction process and post-training. Experimental results demonstrate that LeVo significantly outperforms existing open-source methods in both objective and subjective metrics, while performing competitively with industry systems. Ablation studies further justify the effectiveness of our designs. Audio examples and source code are available at https://levo-demo.github.io and https://github.com/tencent-ailab/songgeneration.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2506.0752

Country: Asia > China (0.93)

Genre: Research Report > New Finding (0.87)

Industry:

Leisure & Entertainment (1.00)
Media > Music (0.89)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

92a7a03e1c716970848a4a86cc8243ee-Paper-Conference.pdf

Neural Information Processing SystemsOct-10-2025, 09:50:43 GMT

accompaniment, songcreator, vocal and accompaniment, (15 more...)

Neural Information Processing Systems

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
Asia > China > Hong Kong (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.92)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)
Information Technology (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

SongPrep: A Preprocessing Framework and End-to-end Model for Full-song Structure Parsing and Lyrics Transcription

Tan, Wei, Lei, Shun, Zhang, Huaicheng, Li, Guangzheng, Zhang, Yixuan, Chen, Hangting, Yu, Jianwei, Gu, Rongzhi, Yu, Dong

arXiv.org Artificial IntelligenceSep-23-2025

Artificial Intelligence Generated Content (AIGC) is currently a popular research area. Among its various branches, song generation has attracted growing interest. Despite the abundance of available songs, effective data preparation remains a significant challenge. Converting these songs into training-ready datasets typically requires extensive manual labeling, which is both time consuming and costly. To address this issue, we propose SongPrep, an automated preprocessing pipeline designed specifically for song data. This framework streamlines key processes such as source separation, structure analysis, and lyric recognition, producing structured data that can be directly used to train song generation models. Furthermore, we introduce SongPrepE2E, an end-to-end structured lyrics recognition model based on pretrained language models. Without the need for additional source separation, SongPrepE2E is able to analyze the structure and lyrics of entire songs and provide precise timestamps. By leveraging context from the whole song alongside pretrained semantic knowledge, SongPrepE2E achieves low Diarization Error Rate (DER) and Word Error Rate (WER) on the proposed SSLD-200 dataset. Downstream tasks demonstrate that training song generation models with the data output by SongPrepE2E enables the generated songs to closely resemble those produced by humans.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2509.17404

Genre: Research Report (1.00)

Industry:

Media > Music (0.68)
Leisure & Entertainment (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.48)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.41)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.35)

Add feedback

Versatile Framework for Song Generation with Prompt-based Control

Zhang, Yu, Guo, Wenxiang, Pan, Changhao, Zhu, Zhiyuan, Li, Ruiqi, Lu, Jingyu, Huang, Rongjie, Zhang, Ruiyuan, Hong, Zhiqing, Jiang, Ziyue, Zhao, Zhou

arXiv.org Artificial IntelligenceAug-26-2025

Song generation focuses on producing controllable high-quality songs based on various prompts. However, existing methods struggle to generate vocals and accompaniments with prompt-based control and proper alignment. Additionally, they fall short in supporting various tasks. To address these challenges, we introduce VersBand, a multi-task song generation framework for synthesizing high-quality, aligned songs with prompt-based control. VersBand comprises these primary models: 1) VocalBand, a decoupled model, leverages the flow-matching method for generating singing styles, pitches, and mel-spectrograms, allowing fast, high-quality vocal generation with style control. 2) AccompBand, a flow-based transformer model, incorporates the Band-MOE, selecting suitable experts for enhanced quality, alignment, and control. This model allows for generating controllable, high-quality accompaniments aligned with vocals. 3) Two generation models, LyricBand for lyrics and MelodyBand for melodies, contribute to the comprehensive multi-task song generation system, allowing for extensive control based on multiple prompts. Experimental results show that VersBand outperforms baseline models across multiple song generation tasks using objective and subjective metrics. Demos and codes are available at https://aaronz345.github.io/VersBandDemo and https://github.com/AaronZ345/VersBand.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2504.19062

Genre: Research Report > New Finding (0.48)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Speech (0.67)

Add feedback

Towards Hallucination-Free Music: A Reinforcement Learning Preference Optimization Framework for Reliable Song Generation

Zhang, Huaicheng, Tan, Wei, Li, Guangzheng, Zhang, Yixuan, Chen, Hangting, Lei, Shun, Yang, Chenyu, Wu, Zhiyong, Wang, Shuai, Huang, Qijun, Yu, Dong

arXiv.org Artificial IntelligenceAug-8-2025

Recent advances in audio-based generative language models have accelerated AI-driven lyric-to-song generation. However, these models frequently suffer from content hallucination, producing outputs misaligned with the input lyrics and undermining musical coherence. Current supervised fine-tuning (SFT) approaches, limited by passive label-fitting, exhibit constrained self-improvement and poor hallucination mitigation. To address this core challenge, we propose a novel reinforcement learning (RL) framework leveraging preference optimization for hallucination control. Our key contributions include: (1) Developing a robust hallucination preference dataset constructed via phoneme error rate (PER) computation and rule-based filtering to capture alignment with human expectations; (2) Implementing and evaluating three distinct preference optimization strategies within the RL framework: Direct Preference Optimization (DPO), Proximal Policy Optimization (PPO), and Group Relative Policy Optimization (GRPO). DPO operates off-policy to enhance positive token likelihood, achieving a significant 7.4% PER reduction. PPO and GRPO employ an on-policy approach, training a PER-based reward model to iteratively optimize sequences via reward maximization and KL-regularization, yielding PER reductions of 4.9% and 4.7%, respectively. Comprehensive objective and subjective evaluations confirm that our methods effectively suppress hallucinations while preserving musical quality. Crucially, this work presents a systematic, RL-based solution to hallucination control in lyric-to-song generation. The framework's transferability also unlocks potential for music style adherence and musicality enhancement, opening new avenues for future generative song research.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2508.05011

Country: Asia > China (0.68)

Genre: Research Report (0.82)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

Via Score to Performance: Efficient Human-Controllable Long Song Generation with Bar-Level Symbolic Notation

Wang, Tongxi, Yu, Yang, Wang, Qing, Qian, Junlang

arXiv.org Artificial IntelligenceAug-5-2025

Song generation is regarded as the most challenging problem in music AIGC; nonetheless, existing approaches have yet to fully overcome four persistent limitations: controllability, generalizability, perceptual quality, and duration. We argue that these shortcomings stem primarily from the prevailing paradigm of attempting to learn music theory directly from raw audio, a task that remains prohibitively difficult for current models. To address this, we present Bar-level AI Composing Helper (BACH), the first model explicitly designed for song generation through human-editable symbolic scores. BACH introduces a tokenization strategy and a symbolic generative procedure tailored to hierarchical song structure. Consequently, it achieves substantial gains in the efficiency, duration, and perceptual quality of song generation. Experiments demonstrate that BACH, with a small model size, establishes a new SOTA among all publicly reported song generation systems, even surpassing commercial solutions such as Suno. Human evaluations further confirm its superiority across multiple subjective metrics.

machine learning, natural language, song generation, (15 more...)

arXiv.org Artificial Intelligence

2508.01394

Genre: Research Report (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment

Liu, Renhang, Hung, Chia-Yu, Majumder, Navonil, Gautreaux, Taylor, Bagherzadeh, Amir Ali, Li, Chuan, Herremans, Dorien, Poria, Soujanya

arXiv.org Artificial IntelligenceJul-29-2025

Diffusion and flow-matching models have revolutionized automatic text-to-audio generation in recent times. These models are increasingly capable of generating high quality and faithful audio outputs capturing to speech and acoustic events. However, there is still much room for improvement in creative audio generation that primarily involves music and songs. Recent open lyrics-to-song models, such as, DiffRhythm, ACE-Step, and LeVo, have set an acceptable standard in automatic song generation for recreational use. However, these models lack fine-grained word-level controllability often desired by musicians in their workflows. To the best of our knowledge, our flow-matching-based JAM is the first effort toward endowing word-level timing and duration control in song generation, allowing fine-grained vocal control. To enhance the quality of generated songs to better align with human preferences, we implement aesthetic alignment through Direct Preference Optimization, which iteratively refines the model using a synthetic dataset, eliminating the need or manual data annotations. Furthermore, we aim to standardize the evaluation of such lyrics-to-song models through our public evaluation dataset JAME. We show that JAM outperforms the existing models in terms of the music-specific attributes.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2507.2088

Genre: Research Report (0.81)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)

Add feedback

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

Liu, Zihan, Ding, Shuangrui, Zhang, Zhixiong, Dong, Xiaoyi, Zhang, Pan, Zang, Yuhang, Cao, Yuhang, Lin, Dahua, Wang, Jiaqi

arXiv.org Artificial IntelligenceFeb-18-2025

Text-to-song generation, the task of creating vocals and accompaniment from textual inputs, poses significant challenges due to domain complexity and data scarcity. Existing approaches often employ multi-stage generation procedures, resulting in cumbersome training and inference pipelines. In this paper, we propose SongGen, a fully open-source, single-stage auto-regressive transformer designed for controllable song generation. The proposed model facilitates fine-grained control over diverse musical attributes, including lyrics and textual descriptions of instrumentation, genre, mood, and timbre, while also offering an optional three-second reference clip for voice cloning. Within a unified auto-regressive framework, SongGen supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately for greater flexibility in downstream applications. We explore diverse token pattern strategies for each mode, leading to notable improvements and valuable insights. Furthermore, we design an automated data preprocessing pipeline with effective quality control. To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline. The generated samples are showcased on our project page at https://liuzh-19.github.io/SongGen/ , and the code will be available at https://github.com/LiuZH-19/SongGen .

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2502.13128

Country: Asia > China (0.46)

Genre:

Research Report (0.51)
Workflow (0.47)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)
Information Technology (0.66)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback